Goto

Collaborating Authors

 synthetic video



Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising

Wang, Yifan, Ji, Liya, Ke, Zhanghan, Yang, Harry, Lim, Ser-Nam, Chen, Qifeng

arXiv.org Artificial Intelligence

We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.


AURA: Development and Validation of an Augmented Unplanned Removal Alert System using Synthetic ICU Videos

Seo, Junhyuk, Moon, Hyeyoon, Jung, Kyu-Hwan, Oh, Namkee, Kim, Taerim

arXiv.org Artificial Intelligence

Unplanned extubation (UE)--the unintended removal of an airway tube--remains a critical patient safety concern in intensive care units (ICUs), often leading to severe complications or death. Real-time UE detection has been limited, largely due to the ethical and privacy challenges of obtaining annotated ICU video data. We propose Augmented Unplanned Removal Alert (AURA), a vision-based risk detection system developed and validated entirely on a fully synthetic video dataset. By leveraging text-to-video diffusion, we generated diverse and clinically realistic ICU scenarios capturing a range of patient behaviors and care contexts. The system applies pose estimation to identify two high-risk movement patterns: collision, defined as hand entry into spatial zones near airway tubes, and agitation, quantified by the velocity of tracked anatomical keypoints. Expert assessments confirmed the realism of the synthetic data, and performance evaluations showed high accuracy for collision detection and moderate performance for agitation recognition. This work demonstrates a novel pathway for developing privacy-preserving, reproducible patient safety monitoring systems with potential for deployment in intensive care settings.


Generative deep learning for foundational video translation in ultrasound

Bhatnagar, Nikolina Tomic Roshni, Jain, Sarthak, Lau, Connor, Liu, Tien-Yu, Gambini, Laura, Arnaout, Rima

arXiv.org Artificial Intelligence

Department of Medicine, Division of Cardiology Bakar Computational Health Sciences Institute UCSF - UC Berkeley Joint Program in Computational Precision Health Department of Radiology, Center for Intelligent Imaging University of California, San Francisco Corresponding Author Keywords: medical imaging, video translation, deep learning, image synthesis, ultrasound Word Count: 4129 Abstract Deep learning (DL) has the potential to revolutionize image acquisition and interpretation across medicine, h owever, attention to data imbalance and missin gness is required . U ltrasound data presents a particular challenge because in addition to different views and structures, it includes several sub - modalities -- such as greyscale and color flow doppler (CFD) -- that are often imbalanced in clinical studies . Image translation can help balance datasets but is challenging for ultrasound sub - modalities to date . Here, we present a generative method for ultrasound CFD - greyscale video translation, t rained on 5 4, 975 videos and tested on 8, 3 68 . The method developed leveraged pixel - wise, adversarial, and perceptual loses and utilized two networks: one for reconstructing anatomic structures and one for denoising to achieve realistic ultrasound imaging . A verage pairwise SSIM between synthetic videos and ground truth was 0.9 1 0.0 4 . Synthetic videos performed indistinguishably from real ones in DL classification and segmentation tasks and when evaluated by b linded clinical experts: F1 score was 0.9 for real and 0.89 for synthetic videos; Dice score between real and synthetic segmentation was 0.97. Overall c linician accuracy in distinguishing real vs synthetic videos was 54 6% (42 - 61%), indicating reali stic synthetic videos . Although trained only on heart videos, the model worked well on ultrasound spanning several clinical domains (av erage SSIM 0.91 0.0 5), demonstrating foundational abilit ies .


VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Li, Zongxia, Wu, Xiyang, Shi, Guangyao, Qin, Yubin, Du, Hongyang, Liu, Fuxiao, Zhou, Tianyi, Manocha, Dinesh, Boyd-Graber, Jordan Lee

arXiv.org Artificial Intelligence

Vision-Language Models (VLMs) have achieved strong results in video understanding, yet a key question remains: do they truly comprehend visual content or only learn shallow correlations between vision and language? Real visual understanding, especially of physics and common sense, is essential for AI systems that interact with the physical world. Current evaluations mostly use real-world videos similar to training data, so high benchmark scores may not reflect real reasoning ability. To address this, we propose negative-control tests using videos that depict physically impossible or logically inconsistent events. We introduce VideoHallu, a synthetic dataset of physics- and commonsense-violating scenes generated with Veo2, Sora, and Kling. It includes expert-annotated question-answer pairs across four categories of violations. Tests of leading VLMs (Qwen-2.5-VL, Video-R1, VideoChat-R1) show that, despite strong results on benchmarks such as MVBench and MMVU, they often miss these violations, exposing gaps in visual reasoning. Reinforcement learning fine-tuning on VideoHallu improves recognition of such violations without reducing standard benchmark performance. Our data is available at https://github.com/zli12321/VideoHallu.git.



Echo-Path: Pathology-Conditioned Echo Video Generation

Muhammad, Kabir Hamzah, Elbatel, Marawan, Qin, Yi, Li, Xiaomeng

arXiv.org Artificial Intelligence

Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, and echocardiography is critical for diagnosis of both common and congenital cardiac conditions. However, echocardiographic data for certain pathologies are scarce, hindering the development of robust automated diagnosis models. In this work, we propose Echo-Path, a novel generative framework to produce echocardiogram videos conditioned on specific cardiac pathologies. Echo-Path can synthesize realistic ultrasound video sequences that exhibit targeted abnormalities, focusing here on atrial septal defect (ASD) and pulmonary arterial hypertension (PAH). Our approach introduces a pathology-conditioning mechanism into a state-of-the-art echo video generator, allowing the model to learn and control disease-specific structural and motion patterns in the heart. Quantitative evaluation demonstrates that the synthetic videos achieve low distribution distances, indicating high visual fidelity. Clinically, the generated echoes exhibit plausible pathology markers. Furthermore, classifiers trained on our synthetic data generalize well to real data and, when used to augment real training sets, it improves downstream diagnosis of ASD and PAH by 7% and 8% respectively. Code, weights and dataset are available here.


AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences

Li, Jieyu, Zhang, Xin, Zhou, Joey Tianyi

arXiv.org Artificial Intelligence

Recent advances in AI-generated content have fueled the rise of highly realistic synthetic videos, posing severe risks to societal trust and digital integrity. Existing benchmarks for video authenticity detection typically suffer from limited realism, insufficient scale, and inadequate complexity, failing to effectively evaluate modern vision-language models against sophisticated forgeries. To address this critical gap, we introduce AEGIS, a novel large-scale benchmark explicitly targeting the detection of hyper-realistic and semantically nuanced AI-generated videos. AEGIS comprises over 10,000 rigorously curated real and synthetic videos generated by diverse, state-of-the-art generative models, including Stable Video Diffusion, CogVideoX-5B, KLing, and Sora, encompassing open-source and proprietary architectures. In particular, AEGIS features specially constructed challenging subsets enhanced with robustness evaluation. Furthermore, we provide multimodal annotations spanning Semantic-Authenticity Descriptions, Motion Features, and Low-level Visual Features, facilitating authenticity detection and supporting downstream tasks such as multimodal fusion and forgery localization. Extensive experiments using advanced vision-language models demonstrate limited detection capabilities on the most challenging subsets of AEGIS, highlighting the dataset's unique complexity and realism beyond the current generalization capabilities of existing models. In essence, AEGIS establishes an indispensable evaluation benchmark, fundamentally advancing research toward developing genuinely robust, reliable, broadly generalizable video authenticity detection methodologies capable of addressing real-world forgery threats. Our dataset is available on https://huggingface.co/datasets/Clarifiedfish/AEGIS.


'Universal' detector spots AI deepfake videos with record accuracy

New Scientist

A universal deepfake detector has achieved the best accuracy yet in spotting multiple types of videos manipulated or completely generated by artificial intelligence. The technology may help flag non-consensual AI-generated pornography, deepfake scams or election misinformation videos. The widespread availability of cheap AI-powered deepfake creation tools has fuelled the out-of-control online spread of synthetic videos. Many depict women – including celebrities and even schoolgirls – in nonconsensual pornography. And deepfakes have also been used to influence political elections, as well as to enhance financial scams targeting both ordinary consumers and company executives. But most AI models trained to detect synthetic video focus on faces – which means they are most effective at spotting one specific type of deepfake, where a real person's face is swapped into an existing video.


GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection

Cai, Suhang, Peng, Xiaohao, Wang, Chong, Cai, Xiaojie, Qian, Jiangbo

arXiv.org Artificial Intelligence

Video anomaly detection (VAD) plays a critical role in public safety applications such as intelligent surveillance. However, the rarity, unpredictability, and high annotation cost of real-world anomalies make it difficult to scale VAD datasets, which limits the performance and generalization ability of existing models. To address this challenge, we propose a generative video-enhanced weakly-supervised video anomaly detection (GV-VAD) framework that leverages text-conditioned video generation models to produce semantically controllable and physically plausible synthetic videos. These virtual videos are used to augment training data at low cost. In addition, a synthetic sample loss scaling strategy is utilized to control the influence of generated synthetic samples for efficient training. The experiments show that the proposed framework outperforms state-of-the-art methods on UCF-Crime datasets. The code is available at https://github.com/Sumutan/GV-VAD.git.